




! pip3 install numpy
! pip3 install pandas
! pip3 install matplotlib
! pip3 install seaborn
! pip3 install scikit-learn
Defaulting to user installation because normal site-packages is not writeable Requirement already satisfied: numpy in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (1.23.0) Defaulting to user installation because normal site-packages is not writeable Requirement already satisfied: pandas in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (2.0.0) Requirement already satisfied: python-dateutil>=2.8.2 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from pandas) (2.8.2) Requirement already satisfied: pytz>=2020.1 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from pandas) (2023.3) Requirement already satisfied: tzdata>=2022.1 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from pandas) (2023.3) Requirement already satisfied: numpy>=1.20.3 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from pandas) (1.23.0) Requirement already satisfied: six>=1.5 in /Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/site-packages (from python-dateutil>=2.8.2->pandas) (1.15.0) Defaulting to user installation because normal site-packages is not writeable Requirement already satisfied: matplotlib in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (3.7.1) Requirement already satisfied: contourpy>=1.0.1 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from matplotlib) (1.0.7) Requirement already satisfied: cycler>=0.10 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from matplotlib) (0.11.0) Requirement already satisfied: fonttools>=4.22.0 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from matplotlib) (4.39.3) Requirement already satisfied: kiwisolver>=1.0.1 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from matplotlib) (1.4.4) Requirement already satisfied: numpy>=1.20 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from matplotlib) (1.23.0) Requirement already satisfied: packaging>=20.0 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from matplotlib) (23.1) Requirement already satisfied: pillow>=6.2.0 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from matplotlib) (9.5.0) Requirement already satisfied: pyparsing>=2.3.1 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from matplotlib) (3.0.9) Requirement already satisfied: python-dateutil>=2.7 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from matplotlib) (2.8.2) Requirement already satisfied: importlib-resources>=3.2.0 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from matplotlib) (5.12.0) Requirement already satisfied: zipp>=3.1.0 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from importlib-resources>=3.2.0->matplotlib) (3.15.0) Requirement already satisfied: six>=1.5 in /Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/site-packages (from python-dateutil>=2.7->matplotlib) (1.15.0) Defaulting to user installation because normal site-packages is not writeable Requirement already satisfied: seaborn in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (0.12.2) Requirement already satisfied: numpy!=1.24.0,>=1.17 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from seaborn) (1.23.0) Requirement already satisfied: pandas>=0.25 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from seaborn) (2.0.0) Requirement already satisfied: matplotlib!=3.6.1,>=3.1 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from seaborn) (3.7.1) Requirement already satisfied: contourpy>=1.0.1 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (1.0.7) Requirement already satisfied: cycler>=0.10 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (0.11.0) Requirement already satisfied: fonttools>=4.22.0 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (4.39.3) Requirement already satisfied: kiwisolver>=1.0.1 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (1.4.4) Requirement already satisfied: packaging>=20.0 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (23.1) Requirement already satisfied: pillow>=6.2.0 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (9.5.0) Requirement already satisfied: pyparsing>=2.3.1 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (3.0.9) Requirement already satisfied: python-dateutil>=2.7 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (2.8.2) Requirement already satisfied: importlib-resources>=3.2.0 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (5.12.0) Requirement already satisfied: pytz>=2020.1 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from pandas>=0.25->seaborn) (2023.3) Requirement already satisfied: tzdata>=2022.1 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from pandas>=0.25->seaborn) (2023.3) Requirement already satisfied: zipp>=3.1.0 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from importlib-resources>=3.2.0->matplotlib!=3.6.1,>=3.1->seaborn) (3.15.0) Requirement already satisfied: six>=1.5 in /Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/site-packages (from python-dateutil>=2.7->matplotlib!=3.6.1,>=3.1->seaborn) (1.15.0) Defaulting to user installation because normal site-packages is not writeable Requirement already satisfied: scikit-learn in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (1.2.2) Requirement already satisfied: numpy>=1.17.3 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from scikit-learn) (1.23.0) Requirement already satisfied: scipy>=1.3.2 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from scikit-learn) (1.9.1) Requirement already satisfied: joblib>=1.1.1 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from scikit-learn) (1.2.0) Requirement already satisfied: threadpoolctl>=2.0.0 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from scikit-learn) (3.1.0)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn
from pandas import Series, DataFrame
from pylab import rcParams
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import classification_report
import seaborn as sns
sns.set_style('white')
sns.set_context('notebook')
# suppress warnings
import warnings;
warnings.simplefilter('ignore')
%matplotlib inline
df = pd.read_csv("Diamonds Prices2022.csv")
df.sample(10)
| Unnamed: 0 | carat | cut | color | clarity | depth | table | price | x | y | z | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 13139 | 13140 | 1.07 | Premium | F | SI1 | 62.5 | 59.0 | 5436 | 6.57 | 6.52 | 4.09 |
| 42970 | 42971 | 0.41 | Ideal | E | VVS1 | 60.9 | 55.0 | 1367 | 4.80 | 4.83 | 2.93 |
| 36285 | 36286 | 0.34 | Good | E | SI2 | 63.7 | 55.0 | 477 | 4.43 | 4.46 | 2.83 |
| 101 | 102 | 0.75 | Premium | E | SI1 | 59.9 | 54.0 | 2760 | 6.00 | 5.96 | 3.58 |
| 20422 | 20423 | 1.01 | Ideal | G | IF | 62.8 | 57.0 | 8778 | 6.42 | 6.39 | 4.02 |
| 30462 | 30463 | 0.31 | Ideal | D | SI1 | 62.7 | 56.0 | 732 | 4.36 | 4.32 | 2.72 |
| 18825 | 18826 | 1.70 | Very Good | J | SI1 | 59.1 | 61.0 | 7713 | 7.79 | 7.85 | 4.62 |
| 50956 | 50957 | 0.66 | Very Good | H | VVS1 | 61.9 | 59.0 | 2323 | 5.59 | 5.65 | 3.48 |
| 30488 | 30489 | 0.31 | Ideal | D | SI1 | 62.3 | 54.0 | 732 | 4.37 | 4.33 | 2.71 |
| 23203 | 23204 | 1.06 | Ideal | D | VVS2 | 61.1 | 56.0 | 11209 | 6.58 | 6.59 | 4.02 |
There are some NaN values in the dataset.
# count the numbers of NaN values in each column
df.isnull().sum()
Unnamed: 0 0 carat 0 cut 0 color 0 clarity 0 depth 0 table 0 price 0 x 0 y 0 z 0 dtype: int64
Because the null values are in a column that have no meaning or relations to the other columns, we can drop this column.
# drop first column (unnamed)
df = df.drop(df.columns[0], axis=1)
df.head()
| carat | cut | color | clarity | depth | table | price | x | y | z | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.23 | Ideal | E | SI2 | 61.5 | 55.0 | 326 | 3.95 | 3.98 | 2.43 |
| 1 | 0.21 | Premium | E | SI1 | 59.8 | 61.0 | 326 | 3.89 | 3.84 | 2.31 |
| 2 | 0.23 | Good | E | VS1 | 56.9 | 65.0 | 327 | 4.05 | 4.07 | 2.31 |
| 3 | 0.29 | Premium | I | VS2 | 62.4 | 58.0 | 334 | 4.20 | 4.23 | 2.63 |
| 4 | 0.31 | Good | J | SI2 | 63.3 | 58.0 | 335 | 4.34 | 4.35 | 2.75 |
# check the data types of each column
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 53943 entries, 0 to 53942 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 carat 53943 non-null float64 1 cut 53943 non-null object 2 color 53943 non-null object 3 clarity 53943 non-null object 4 depth 53943 non-null float64 5 table 53943 non-null float64 6 price 53943 non-null int64 7 x 53943 non-null float64 8 y 53943 non-null float64 9 z 53943 non-null float64 dtypes: float64(6), int64(1), object(3) memory usage: 4.1+ MB
Before viewing the correlation between the features, we need to convert the categorical features in numerical features.
Like this:
so less is the value of this features, more is the quality of the diamonds ==> better is the price of the diamonds, but how much?
# convert categorical data to numerical data
cutMap = {'Ideal': 5, 'Premium': 4, 'Very Good': 3, 'Good': 2, 'Fair': 1}
colorMap = {'D': 1, 'E': 2, 'F': 3, 'G': 4, 'H': 5, 'I': 6, 'J': 7}
clarityMap = {'IF': 1, 'VVS1': 2, 'VVS2': 3, 'VS1': 4, 'VS2': 5, 'SI1': 6, 'SI2': 7, 'I1': 8}
df['cut'] = df['cut'].map(cutMap)
df['color'] = df['color'].map(colorMap)
df['clarity'] = df['clarity'].map(clarityMap)
df.head()
| carat | cut | color | clarity | depth | table | price | x | y | z | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.23 | 5 | 2 | 7 | 61.5 | 55.0 | 326 | 3.95 | 3.98 | 2.43 |
| 1 | 0.21 | 4 | 2 | 6 | 59.8 | 61.0 | 326 | 3.89 | 3.84 | 2.31 |
| 2 | 0.23 | 2 | 2 | 4 | 56.9 | 65.0 | 327 | 4.05 | 4.07 | 2.31 |
| 3 | 0.29 | 4 | 6 | 5 | 62.4 | 58.0 | 334 | 4.20 | 4.23 | 2.63 |
| 4 | 0.31 | 2 | 7 | 7 | 63.3 | 58.0 | 335 | 4.34 | 4.35 | 2.75 |
# heatmap that shows the correlation between the different features
sns.heatmap(df.corr())
<Axes: >
# 2 side by side plots of price vs depth for both price vs table
fig, (axis1,axis2) = plt.subplots(1,2,figsize=(15,5))
axis1.scatter(df['depth'], df['price'])
axis1.set_title('Depth vs Price')
axis1.set_xlabel('Depth')
axis1.set_ylabel('Price')
axis2.scatter(df['depth'], df['price'])
axis2.set_title('Table vs Price')
axis2.set_xlabel('Table')
Text(0.5, 0, 'Table')
#drop depth and table columns
dfCleaned = df.drop(['depth', 'table'], axis=1)
dfCleaned.head()
| carat | cut | color | clarity | price | x | y | z | |
|---|---|---|---|---|---|---|---|---|
| 0 | 0.23 | 5 | 2 | 7 | 326 | 3.95 | 3.98 | 2.43 |
| 1 | 0.21 | 4 | 2 | 6 | 326 | 3.89 | 3.84 | 2.31 |
| 2 | 0.23 | 2 | 2 | 4 | 327 | 4.05 | 4.07 | 2.31 |
| 3 | 0.29 | 4 | 6 | 5 | 334 | 4.20 | 4.23 | 2.63 |
| 4 | 0.31 | 2 | 7 | 7 | 335 | 4.34 | 4.35 | 2.75 |
#custom legend for cut column
cutLegend = {1: 'Fair', 2: 'Good', 3: 'Very Good', 4: 'Premium', 5: 'Ideal'}
# plot carat vs price, with cut hue and custom cutLegend
sns.lmplot(x='carat', y='price', data=dfCleaned, hue='cut', fit_reg=False, legend=False)
plt.legend(cutLegend.values())
<matplotlib.legend.Legend at 0x290f90bb0>
# put 3 scatter plot aside each other, for x and y and z in relation to the price
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(15, 5))
ax1.scatter(dfCleaned['x'], dfCleaned['price'], picker=True)
ax1.set_title('x vs price')
ax1.set_xlabel('x')
ax1.set_ylabel('price')
ax2.scatter(dfCleaned['y'], dfCleaned['price'])
ax2.set_title('y vs price')
ax2.set_xlabel('y')
ax3.scatter(dfCleaned['z'], dfCleaned['price'])
ax3.set_title('z vs price')
ax3.set_xlabel('z')
Text(0.5, 0, 'z')
# convert coulmns to binary values using get_dummies
dfAdjusted = pd.get_dummies(dfCleaned, columns=['cut', 'color', 'clarity'])
dfAdjusted.head()
| carat | price | x | y | z | cut_1 | cut_2 | cut_3 | cut_4 | cut_5 | ... | color_6 | color_7 | clarity_1 | clarity_2 | clarity_3 | clarity_4 | clarity_5 | clarity_6 | clarity_7 | clarity_8 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.23 | 326 | 3.95 | 3.98 | 2.43 | False | False | False | False | True | ... | False | False | False | False | False | False | False | False | True | False |
| 1 | 0.21 | 326 | 3.89 | 3.84 | 2.31 | False | False | False | True | False | ... | False | False | False | False | False | False | False | True | False | False |
| 2 | 0.23 | 327 | 4.05 | 4.07 | 2.31 | False | True | False | False | False | ... | False | False | False | False | False | True | False | False | False | False |
| 3 | 0.29 | 334 | 4.20 | 4.23 | 2.63 | False | False | False | True | False | ... | True | False | False | False | False | False | True | False | False | False |
| 4 | 0.31 | 335 | 4.34 | 4.35 | 2.75 | False | True | False | False | False | ... | False | True | False | False | False | False | False | False | True | False |
5 rows × 25 columns
# split the data into training and testing sets
X = dfAdjusted.drop('price', axis=1)
y = dfAdjusted['price']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
# standardize the data
scaler = preprocessing.StandardScaler().fit(X_train)
# transform the data
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
# create dataframe from the scaled data
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_test.columns)
# Polynomial Regression
from sklearn.preprocessing import PolynomialFeatures
poly_reg = PolynomialFeatures(degree = 8)
X_poly = poly_reg.fit_transform(X_train_scaled)
poly_reg.fit(X_poly, y_train)
# Predicting a new result with Polynomial Regression
y_pred = poly_reg.fit_transform(X_test_scaled)
#print accuracy of the model
print(classification_report(y_test, y_pred))
# Logistic Regression
logreg = LogisticRegression(solver='sag', max_iter=1)
logreg.fit(X_train_scaled, y_train)
# Predicting a new result with Logistic Regression
y_pred = logreg.predict(X_test_scaled)
#print accuracy of the model
print(classification_report(y_test, y_pred))
# Ridge regression
from sklearn.linear_model import Ridge
# SAG is a stochastic optimization algorithm that is particularly useful for large-scale linear regression problems.
# SAGA is a variant of SAG that also supports the non-smooth penalty=l1
#ridgeReg = Ridge(alpha=0.05, solver='sag')
ridgeReg = Ridge(alpha=0.05, solver='saga')
ridgeReg.fit(X_train_scaled, y_train)
# Predicting a new result with Ridge Regression
y_pred = ridgeReg.predict(X_test_scaled)
#print accuracy of the model
print(ridgeReg.score(X_test_scaled, y_test))
0.922425670476217
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge
# parameters that we want to tune
alpha = [1e-15, 1e-10, 1e-8, 1e-4, 1e-3,1e-2, 1, 5, 10, 20, 30, 50, 70, 100, 150, 200, 300, 500, 700, 1000, 1500, 2000]
ridge = Ridge()
parameters = {'alpha': alpha}
# GridSearchCV will try all the combinations of the parameters
ridge_regressor = GridSearchCV(ridge, parameters,scoring='neg_mean_squared_error', cv=5)
ridge_regressor.fit(X_train_scaled, y_train)
print(ridge_regressor.best_params_)
print(-ridge_regressor.best_score_)
{'alpha': 20}
1295574.3904235393
ridge = Ridge(alpha=ridge_regressor.best_params_['alpha'])
ridge.fit(X_train_scaled, y_train)
# Predicting a new result with Ridge Regression
y_pred = ridge.predict(X_test_scaled)
#print accuracy of the model
print(ridge.score(X_test_scaled, y_test))
0.922375573729045
#Lasso regression
from sklearn.linear_model import Lasso
lassoReg = Lasso(alpha=0.6)
lassoReg.fit(X_train_scaled, y_train)
# Predicting a new result with Lasso Regression
y_pred = lassoReg.predict(X_test_scaled)
#print accuracy of the model
print(lassoReg.score(X_test_scaled, y_test))
0.9223987609445108
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Lasso
# modify train and test data, with only x,y,z
X_train = X_train_scaled[['x', 'y', 'z']]
X_test = X_test_scaled[['x', 'y', 'z']]
# parameters that we want to tune
alpha = [1e-15, 1e-10, 1e-8, 1e-4, 1e-3,1e-2, 0.6, 1, 5, 10, 20, 30, 50, 70, 100, 150, 200, 300, 500, 700, 1000, 1500, 2000]
lassoReg = Lasso(alpha=0.6)
lassoReg.fit(X_train, y_train)
# Predicting a new result with Lasso Regression
y_pred = lassoReg.predict(X_test)
#print accuracy of the model
print(lassoReg.score(X_test, y_test))
0.7804924706294081
lasso = Lasso()
parameters = {'alpha': alpha}
# GridSearchCV will try all the combinations of the parameters
lasso_regressor = GridSearchCV(lasso, parameters,scoring='neg_mean_squared_error', cv=5)
lasso_regressor.fit(X_train, y_train)
print(lasso_regressor.best_params_)
print(-lasso_regressor.best_score_)
lasso = Lasso(alpha=lasso_regressor.best_params_['alpha'])
lasso.fit(X_train, y_train)
# Predicting a new result with Lasso Regression
y_pred = lasso.predict(X_test)
#print accuracy of the model
print(lasso.score(X_test, y_test))
{'alpha': 50}
3462425.630456732
0.7803486557847755
#importing the SGD Regressor
from sklearn.linear_model import SGDRegressor
# Fitting SGD Regressor to the Training set
sgd_reg = SGDRegressor()
sgd_reg.fit(X_train_scaled, y_train)
# Predicting a new result with SGD Regressor
y_pred = sgd_reg.predict(X_test_scaled)
#accuracy of the model SGD Regressor
print(sgd_reg.score(X_test_scaled, y_test))
0.9222990712101318
# modify train and test data, with only x,y,z
X_train = X_train_scaled[['x']]
X_test = X_test_scaled[['x']]
# Fitting SGD Regressor to the Training set
sgd_reg = SGDRegressor()
sgd_reg.fit(X_train, y_train)
# Predicting a new result with SGD Regressor
y_pred = sgd_reg.predict(X_test)
#accuracy of the model SGD Regressor
print(sgd_reg.score(X_test, y_test))
0.780279120075163
# modify train and test data, with only x,y,z
X_train = X_train_scaled[['carat']]
X_test = X_test_scaled[['carat']]
# Fitting SGD Regressor to the Training set
sgd_reg = SGDRegressor()
sgd_reg.fit(X_train, y_train)
# Predicting a new result with SGD Regressor
y_pred = sgd_reg.predict(X_test)
#accuracy of the model SGD Regressor
print(sgd_reg.score(X_test, y_test))
0.7805163504962933
We Can see that the accuracy obtained train only with x or carat is fair, but obviusly is not the best model.
# Decision Tree
# Descision tree is a non-parametric supervised learning method used for classification and regression.
# is
from sklearn.tree import DecisionTreeRegressor
dtree = DecisionTreeRegressor()
dtree.fit(X_train_scaled, y_train)
# Predicting a new result with Decision Tree
y_pred = dtree.predict(X_test_scaled)
#print accuracy of the model
print(dtree.score(X_test_scaled, y_test))
0.9649467463950754
# Random Forest Regression
from sklearn.ensemble import RandomForestRegressor
rfr = RandomForestRegressor(n_estimators=100, random_state=0)
rfr.fit(X_train_scaled, y_train)
# Predicting a new result with Random Forest Regression
y_pred = rfr.predict(X_test_scaled)
#print accuracy of the model
print(rfr.score(X_test_scaled, y_test))
0.9784884722052908
# Support Vector Machine model
from sklearn.svm import SVR
svr_model = SVR()
svr_model.fit(X_train_scaled, y_train)
# Predicting a new result with Support Vector Machine
y_pred = svr_model.predict(X_test_scaled)
#print accuracy of the model
print(svr_model.score(X_test_scaled, y_test))
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Splitting the dataset into the Training set and Test set
X = dfCleaned.drop('color', axis=1)
y = dfCleaned['color']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Feature Scaling
sc = StandardScaler()
# transform the data
X_train_scaled = sc.fit_transform(X_train)
X_test_scaled = sc.transform(X_test)
# create dataframe from the scaled data
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_test.columns)
# KNN model for classification of the color
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_scaled, y_train)
# Predicting a new result with KNN
y_pred = knn.predict(X_test_scaled)
#print accuracy of the model
print('KNN Classification Accurancy: ',knn.score(X_test_scaled, y_test))
KNN Classification Accurancy: 0.4326628973954954
error = []
# KNeighborsClassifier works well with not to much feautures, up to 4 or 5 works really well
# Calculating error for K values between 1 and 40
for i in range(1, 60):
knn = KNeighborsClassifier(n_neighbors=i)
knn.fit(X_train, y_train)
pred_i = knn.predict(X_test)
error.append(np.mean(pred_i != y_test)) # calculate mean error
plt.figure(figsize=(12, 6))
plt.plot(range(1, 60), error, color='red', linestyle='dashed', marker='o', markerfacecolor='blue', markersize=10)
plt.title('Error Rate K Value')
plt.xlabel('K Value')
plt.ylabel('Mean Error')
Text(0, 0.5, 'Mean Error')
# KNN model for classification of the color
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train_scaled, y_train)
# Predicting a new result with KNN
y_pred = knn.predict(X_test_scaled)
#print accuracy of the model
print('KNN Classification Accurancy: ',knn.score(X_test_scaled, y_test))
KNN Classification Accurancy: 0.44758550375382333
# SVM model for classification of the color
from sklearn.svm import SVC
svc_model = SVC()
svc_model.fit(X_train_scaled, y_train)
# Predicting a new result with SVM
y_pred = svc_model.predict(X_test_scaled)
#print accuracy of the model
print('SVM Classification Accurancy: ',svc_model.score(X_test_scaled, y_test))
SVM Classification Accurancy: 0.3821484845676152
# best SUPPORT VECTOR MACHINE
from sklearn.svm import SVC
svc_model = SVC(C=1, gamma=0.1)
svc_model.fit(X_train_scaled, y_train)
# Predicting a new result with SVM
y_pred = svc_model.predict(X_test_scaled)
#print accuracy of the model
print('SVM Classification Accurancy: ',svc_model.score(X_test_scaled, y_test))
SVM Classification Accurancy: 0.3745481508944295
# Decision Tree
# Descision tree is a non-parametric supervised learning method used for classification and regression.
# is
from sklearn.tree import DecisionTreeClassifier
dtree = DecisionTreeClassifier()
dtree.fit(X_train_scaled, y_train)
# Predicting a new result with Decision Tree
y_pred = dtree.predict(X_test_scaled)
#print accuracy of the model
print('Decision Tree Classification Accurancy: ',dtree.score(X_test_scaled, y_test))
Decision Tree Classification Accurancy: 0.5130225229400315
# Random Forest
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=100)
rfc.fit(X_train_scaled, y_train)
# Predicting a new result with Random Forest
y_pred = rfc.predict(X_test_scaled)
#print accuracy of the model
print('Random Forest Classification Accurancy: ',rfc.score(X_test_scaled, y_test))
Random Forest Classification Accurancy: 0.17582723143942905
# importance of the features in the Random Forest model
feature_imp = pd.Series(rfc.feature_importances_, index=X.columns).sort_values(ascending=False)
# Creating a bar plot
sns.barplot(x=feature_imp, y=feature_imp.index)
<Axes: >
X = X_train_scaled.drop(['carat','clarity','cut'], axis=1)
X_test = X_test_scaled.drop(['carat','clarity','cut'], axis=1)
rfc = RandomForestClassifier(n_estimators=100)
rfc.fit(X, y_train)
# Predicting a new result with Random Forest
y_pred = rfc.predict(X_test)
#print accuracy of the model
print('Random Forest Classification Accurancy: ',rfc.score(X_test, y_test))
Random Forest Classification Accurancy: 0.1738808045231254
# add volume of the diamond V = (4/3) x π x (r1 x r2 x r3)
dfCleaned['volume'] = (4/3) * np.pi * (dfCleaned['x'] * dfCleaned['y'] * dfCleaned['z'])
# add density of the diamond
dfCleaned['density'] = dfCleaned['carat'] / dfCleaned['volume']
# add price per carat
dfCleaned['price_per_carat'] = dfCleaned['price'] / dfCleaned['carat']
# add price per volume
dfCleaned['price_per_volume'] = dfCleaned['price'] / dfCleaned['volume']
# Splitting the dataset into the Training set and Test set
X = dfCleaned.drop('color', axis=1)
y = dfCleaned['color']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Feature Scaling
sc = StandardScaler()
# Drop the new columns before scaling
X_train = X_train.drop(['density', 'price_per_volume'], axis=1)
X_test = X_test.drop(['density', 'price_per_volume'], axis=1)
# transform the data
X_train_scaled = sc.fit_transform(X_train)
X_test_scaled = sc.transform(X_test)
# create dataframe from the scaled data
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_test.columns)
rfc = RandomForestClassifier(n_estimators=100)
rfc.fit(X_train, y_train)
# Predicting a new result with Random Forest
y_pred = rfc.predict(X_test)
#print accuracy of the model
print('Random Forest Classification Accurancy: ',rfc.score(X_test, y_test))
# importance of the features in the Random Forest model
feature_imp = pd.Series(rfc.feature_importances_, index=X_train.columns).sort_values(ascending=False)
# Creating a bar plot
sns.barplot(x=feature_imp, y=feature_imp.index)
Random Forest Classification Accurancy: 0.6272129020298453
<Axes: >
sns.pairplot(dfCleaned, hue='color', palette='coolwarm')
<seaborn.axisgrid.PairGrid at 0x17df6ec40>
sns.pairplot(dfCleaned, hue='clarity', palette='coolwarm')
<seaborn.axisgrid.PairGrid at 0x17df5a970>
sns.pairplot(dfCleaned, hue='cut', palette='coolwarm')
<seaborn.axisgrid.PairGrid at 0x29f2cfa60>
# Clustering with K-Means
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3)
kmeans.fit(dfCleaned.drop('clarity', axis=1))
# view the cluster centeroids
print(kmeans.cluster_centers_)
# view the labels
print(kmeans.labels_)
# add a new column to the dataframe with the cluster labels
dfCleaned['cluster'] = kmeans.labels_
# plot the clusters
sns.lmplot(x='x', y='carat', data=dfCleaned, hue='cluster', palette='coolwarm', aspect=1, fit_reg=False)
[[4.92848617e-01 3.98960889e+00 3.35044988e+00 1.46154640e+03 4.99386077e+00 5.00082462e+00 3.08522615e+00 9.08843043e-04] [1.11737698e+00 3.72306084e+00 3.91341567e+00 5.79390642e+03 6.62044893e+00 6.61670130e+00 4.08931059e+00 1.99954257e+00] [1.71822589e+00 3.89469011e+00 4.15485704e+00 1.33608863e+04 7.63643758e+00 7.63806784e+00 4.70089505e+00 1.00000000e+00]] [0 0 0 ... 0 0 0]
<seaborn.axisgrid.FacetGrid at 0x292802e50>
kmeans = KMeans(n_clusters=8)
kmeans.fit(dfCleaned.drop('clarity', axis=1))
# add a new column to the dataframe with the cluster labels
dfCleaned['cluster'] = kmeans.labels_
# side by side plots of the clusters and the original data
fig, (ax1, ax2) = plt.subplots(1, 2, sharey=True, figsize=(15,6))
ax1.set_title('KMeans')
sns.scatterplot(x='x', y='carat', data=dfCleaned, hue='cluster', palette='coolwarm', ax=ax1, legend=True)
ax2.set_title("Original")
sns.scatterplot(x='x', y='carat', data=dfCleaned, hue='clarity', palette='coolwarm', ax=ax2, legend=True)
<Axes: title={'center': 'Original'}, xlabel='x', ylabel='carat'>
# Clustering models: K-Means, Hierarchical Clustering, DBSCAN, Gaussian Mixture Model (GMM), Mean Shift, Spectral Clustering, Affinity Propagation,
# Agglomerative Clustering, Birch, Mini-Batch K-Means, OPTICS, and more.
# Clastering with Hierarchical Clustering
from sklearn.cluster import AgglomerativeClustering
agg = AgglomerativeClustering(n_clusters=8)
agg.fit(dfCleaned.drop('clarity', axis=1))
# view the labels
print(agg.labels_)
# add a new column to the dataframe with the cluster labels
dfCleaned['cluster'] = agg.labels_
[5 5 5 ... 0 0 0]
# side by side plots of the clusters and the original data
fig, (ax1, ax2) = plt.subplots(1, 2, sharey=True, figsize=(15,6))
ax1.set_title('AgglomerativeClustering')
sns.scatterplot(x='price', y='carat', data=dfCleaned, hue='cluster', palette='coolwarm', ax=ax1)
ax2.set_title("Original")
sns.scatterplot(x='price', y='carat', data=dfCleaned, hue='clarity', palette='coolwarm', ax=ax2)
<Axes: title={'center': 'Original'}, xlabel='price', ylabel='carat'>
# side by side plots of the clusters and the original data
fig, (ax1, ax2) = plt.subplots(1, 2, sharey=True, figsize=(15,6))
ax1.set_title('DBSCAN CLustering')
sns.scatterplot(x='price', y='carat', data=dfCleaned, hue='cluster', palette='coolwarm', ax=ax1)
ax2.set_title("Original")
sns.scatterplot(x='price', y='carat', data=dfCleaned, hue='clarity', palette='coolwarm', ax=ax2)
<Axes: title={'center': 'Original'}, xlabel='price', ylabel='carat'>
import umap
# create a UMAP model with 2 dimensions
# n_neighbors default is 15
umapModel = umap.UMAP(n_components=2, n_neighbors=5, random_state=42, min_dist=0.1)
# fit the model to the data
manifold = umapModel.fit(dfCleaned.drop('clarity', axis=1))
#! pip install pandas matplotlib datashader bokeh holoviews colorcet scikit-image #pip
#! pip install umap-learn[plot]
import umap.plot
y = dfCleaned['clarity'].values.flatten()
# plot the UMAP model with the colors genereated by the UMAP model
umap.plot.points(manifold, labels=y, theme="fire", width=1500, height=1000);
# Image size of 636x251086 pixels is too large. It must be less than 2^16 in each direction.
# plot the UMAP model with the colors genereated by the UMAP model
# plot different UMAP models with different parameters
fig, ax_array = plt.subplots(4, 4, figsize=(15, 15))
a = 0
b = 0
for n in [5, 10, 15, 20]:
for d in [0.1, 0.25, 0.5, 0.75]:
umapModel = umap.UMAP(n_components=2, n_neighbors=n, random_state=42, min_dist=d)
manifold = umapModel.fit(dfCleaned.drop('clarity', axis=1))
umap.plot.points(manifold, labels=y, theme="fire", width=1500, height=1000, ax=ax_array[a, b])
ax_array[a, b].set_title(f"n_neighbors={n}, min_dist={d}")
b += 1
a += 1
b = 0
# plot the UMAP model with the colors genereated by the UMAP model
umap.plot.points(manifold, labels=y, theme="fire", width=1500, height=1000);
/Users/petercatania/Library/Python/3.9/lib/python/site-packages/sklearn/manifold/_spectral_embedding.py:274: UserWarning: Graph is not fully connected, spectral embedding may not work as expected. warnings.warn( /Users/petercatania/Library/Python/3.9/lib/python/site-packages/sklearn/manifold/_spectral_embedding.py:274: UserWarning: Graph is not fully connected, spectral embedding may not work as expected. warnings.warn( /Users/petercatania/Library/Python/3.9/lib/python/site-packages/sklearn/manifold/_spectral_embedding.py:274: UserWarning: Graph is not fully connected, spectral embedding may not work as expected. warnings.warn( /Users/petercatania/Library/Python/3.9/lib/python/site-packages/sklearn/manifold/_spectral_embedding.py:274: UserWarning: Graph is not fully connected, spectral embedding may not work as expected. warnings.warn( /Users/petercatania/Library/Python/3.9/lib/python/site-packages/sklearn/manifold/_spectral_embedding.py:274: UserWarning: Graph is not fully connected, spectral embedding may not work as expected. warnings.warn( /Users/petercatania/Library/Python/3.9/lib/python/site-packages/sklearn/manifold/_spectral_embedding.py:274: UserWarning: Graph is not fully connected, spectral embedding may not work as expected. warnings.warn( /Users/petercatania/Library/Python/3.9/lib/python/site-packages/sklearn/manifold/_spectral_embedding.py:274: UserWarning: Graph is not fully connected, spectral embedding may not work as expected. warnings.warn( /Users/petercatania/Library/Python/3.9/lib/python/site-packages/sklearn/manifold/_spectral_embedding.py:274: UserWarning: Graph is not fully connected, spectral embedding may not work as expected. warnings.warn( /Users/petercatania/Library/Python/3.9/lib/python/site-packages/sklearn/manifold/_spectral_embedding.py:274: UserWarning: Graph is not fully connected, spectral embedding may not work as expected. warnings.warn( /Users/petercatania/Library/Python/3.9/lib/python/site-packages/sklearn/manifold/_spectral_embedding.py:274: UserWarning: Graph is not fully connected, spectral embedding may not work as expected. warnings.warn( /Users/petercatania/Library/Python/3.9/lib/python/site-packages/sklearn/manifold/_spectral_embedding.py:274: UserWarning: Graph is not fully connected, spectral embedding may not work as expected. warnings.warn( /Users/petercatania/Library/Python/3.9/lib/python/site-packages/sklearn/manifold/_spectral_embedding.py:274: UserWarning: Graph is not fully connected, spectral embedding may not work as expected. warnings.warn(